Skip to content

Conversation

@runcom
Copy link
Member

@runcom runcom commented Mar 13, 2019

Signed-off-by: Antonio Murdaca [email protected]

- What I did

rewrite the daemon as a controller with a sync handler to deal with transient vs permanent error mainly.

effectively close #395
also close #401
close #299

- How to verify it

CI should tell us if this is generally good (hopefully)

- Description for the changelog

/cc @derekwaynecarr

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 13, 2019
@runcom
Copy link
Member Author

runcom commented Mar 13, 2019

/hold

The first commit just added the backbone of the controller, I'm going to work on the handling of errors and syncs now.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 13, 2019
@runcom runcom force-pushed the mcd-controller branch 15 times, most recently from 466fd20 to b1c8c80 Compare March 13, 2019 15:30
@runcom
Copy link
Member Author

runcom commented Mar 13, 2019

aws issue:

level=error msg="1 error occurred:"
level=error msg="\t* module.vpc.aws_route_table_association.private_routing[3]: 1 error occurred:"
level=error msg="\t* aws_route_table_association.private_routing.3: timeout while waiting for state to become 'success' (timeout: 5m0s)"

/retest

@runcom
Copy link
Member Author

runcom commented Mar 13, 2019

/retest

@runcom
Copy link
Member Author

runcom commented Mar 13, 2019

oh cool, this is passing all e2e already 😄 guess I'll work on the rollback functionality for partial writes tomorrow then and test it by excercising fuzzing the calls to the apiserver and creating bad MCs to "degrade" (with a different meaning now) the nodes.
I'm still not sure if I should pull in the "Unreconcilable" vs "Degraded" annotation here. Right now, a reconcilable error will mark node as degraded but other errors are not (as we wanted right? @cgwalters @derekwaynecarr )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's...going to be nontrivial.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be complitely stupid, but isn't this just a matter of inverting the argument to updateFiles?

	// update files on disk that need updating
	if err := dn.updateFiles(newConfig, oldConfig); err != nil {
		return err
	}

I'm sure I'm missing something obvious as I've not yet thought about this yet that much

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooo I have the solution actually #401

I'm modify the atomic write PR to split the writing into a real transation and only commit at the very end! I'll try that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that could work. I think where it gets tricky is trying to handle an error when we're in the middle of either writing files or doing a revert. Then we're in an "indeterminate middle state", much like what happens in the middle of a yum/apt update.

But anyways for now we can probably ignore that and yeah, just delete any files added by the new config and restore all the files from the old one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ignorant on yum/apt update - how are those handling such a failure in the middle of an update?

@cgwalters
Copy link
Member

Only skimmed this, offhand looks good to me though!

@cgwalters
Copy link
Member

So yeah I am hacking on #545
and having this would be super useful as I am in a pitched battle with my cluster to fix this:

E0313 20:45:20.221618  108948 writer.go:91] marking degraded due to: Checking initial state: parsing desired osImageURL: not canonical form: "registry.svc.ci.openshift.org/rhcos/machine-os-content:latest"
I0313 20:45:20.235634  108948 daemon.go:439] Entering degraded state; going to sleep

since I messed up the release payload (now fixed to use @sha256) but the issue is the MachineConfigs have the old one and the pool is trying to update but the MCDs are degraded and I can hack this by editing the MC manually, removing the degraded annotations and then kill all the MCD pods but yeah...

@kikisdeliveryservice
Copy link
Contributor

I'm still not sure if I should pull in the "Unreconcilable" vs "Degraded" annotation here.

@runcom I think that the degraded logic should stay the same in this PR and the unreconcilable/degraded changes this should be done in another PR for both for testing and visibility reasons.

@runcom
Copy link
Member Author

runcom commented Mar 14, 2019

this is what I wanted to see:

14:53:37 [github.com/openshift/installer] ‹master*› oc logs machine-config-daemon-wcq7z
I0314 13:53:24.512251    5620 start.go:51] Version: 4.0.0-alpha.0-33-g5861a434
I0314 13:53:24.514231    5620 start.go:87] starting node writer
I0314 13:53:24.543209    5620 run.go:22] Running captured: chroot /rootfs rpm-ostree status --json
I0314 13:53:24.612352    5620 daemon.go:194] Booted osImageURL: docker-registry-default.cloud.registry.upshift.redhat.com/redhat-coreos/maipo@sha256:c09f455cc09673a1a13ae7b54cc4348cda0411e06dfa79ecd0130b35d62e8670 (400.7.20190306.0)
I0314 13:53:24.618722    5620 daemon.go:264] Managing node: ip-10-0-162-26.us-west-2.compute.internal
I0314 13:53:24.657003    5620 start.go:143] Calling chroot("/rootfs")
I0314 13:53:24.657166    5620 start.go:159] Starting MachineConfigDaemon
I0314 13:53:24.657415    5620 daemon.go:466] Enabling Kubelet Healthz Monitor
E0314 13:53:29.812838    5620 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
E0314 13:53:29.812838    5620 reflector.go:134] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfig: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
E0314 13:53:32.819098    5620 reflector.go:134] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfig: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
E0314 13:53:32.819325    5620 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
E0314 13:53:35.824806    5620 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
E0314 13:53:35.824806    5620 reflector.go:134] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfig: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
E0314 13:53:38.830848    5620 reflector.go:134] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfig: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
E0314 13:53:38.831090    5620 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
I0314 13:53:39.958210    5620 run.go:22] Running captured: rpm-ostree status
I0314 13:53:39.991975    5620 daemon.go:724] State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://docker-registry-default.cloud.registry.upshift.redhat.com/redhat-coreos/maipo@sha256:c09f455cc09673a1a13ae7b54cc4348cda0411e06dfa79ecd0130b35d62e8670
              CustomOrigin: Provisioned from oscontainer
                   Version: 400.7.20190306.0 (2019-03-06T22:16:26Z)
I0314 13:53:39.992654    5620 daemon.go:659] Current config: worker-461203d1a5a2360f0341a7fb6d698076
I0314 13:53:39.992679    5620 daemon.go:660] Desired config: worker-b449b6e6352b6718d313ee6291cc0b43
I0314 13:53:39.992832    5620 daemon.go:1005] Current and target osImageURL have matching digest "sha256:c09f455cc09673a1a13ae7b54cc4348cda0411e06dfa79ecd0130b35d62e8670"
I0314 13:53:39.994155    5620 daemon.go:780] Validated on-disk state
I0314 13:53:40.009343    5620 daemon.go:806] Completing pending config worker-b449b6e6352b6718d313ee6291cc0b43
I0314 13:53:40.016509    5620 update.go:690] machine-config-daemon: completed update for config worker-b449b6e6352b6718d313ee6291cc0b43
I0314 13:53:40.025660    5620 daemon.go:812] In desired config worker-b449b6e6352b6718d313ee6291cc0b43

retry after network failures :)

@runcom
Copy link
Member Author

runcom commented Mar 14, 2019

tested this on both upgrade and installation and it's amazing. With this we can also get back to a failed reconcile by deleting the bad machineconfig + creating a new one

@runcom
Copy link
Member Author

runcom commented Mar 14, 2019

/retest

@cgwalters
Copy link
Member

This is awesome - really nice work! From my PoV good to merge now.

/approve

/assign kikisdeliveryservice
for another review and merge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to do it in the PR but let's start trying to add some doc comments to functions like this if we're rewriting them anyways.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's an interesting pattern, didn't realize it was that easy to do "do this if we're returning an error". That's quite hard in other languages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool!

Copy link
Member Author

@runcom runcom Mar 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This functionality looks like it's working fine, added an error in the MCD simulating a...error and it goes back and rewrite files from oldConfig just fine (I expect to work the same for ssh keys) - in the logs below notice how the file "/etc/chrony.conf" is no longer there once we write from the old config

I0314 16:51:27.336603    6697 update.go:187] Checking reconcilable for config worker-2231f5479a086534c246ec862625a3f1 to worker-41c30245368d268544394d45a2e70598
I0314 16:51:27.336628    6697 update.go:245] Checking if configs are reconcilable
I0314 16:51:27.338097    6697 update.go:385] Updating files
I0314 16:51:27.338118    6697 update.go:586] Writing file "/etc/sysconfig/crio-network"
I0314 16:51:27.344087    6697 update.go:586] Writing file "/var/lib/kubelet/config.json"
I0314 16:51:27.347939    6697 update.go:586] Writing file "/etc/kubernetes/ca.crt"
I0314 16:51:27.356594    6697 update.go:586] Writing file "/etc/sysctl.d/forward.conf"
I0314 16:51:27.359890    6697 update.go:586] Writing file "/etc/kubernetes/kubelet-plugins/volume/exec/.dummy"
I0314 16:51:27.362261    6697 update.go:586] Writing file "/etc/containers/registries.conf"
I0314 16:51:27.365901    6697 update.go:586] Writing file "/etc/containers/storage.conf"
I0314 16:51:27.369021    6697 update.go:586] Writing file "/etc/crio/crio.conf"
I0314 16:51:27.372757    6697 update.go:586] Writing file "/etc/kubernetes/kubelet.conf"
I0314 16:51:27.376479    6697 update.go:586] Writing file "/etc/chrony.conf"
I0314 16:51:27.380001    6697 update.go:523] Writing systemd unit "kubelet.service"
I0314 16:51:27.383001    6697 update.go:558] Enabling systemd unit "kubelet.service"
I0314 16:51:27.383052    6697 update.go:476] /etc/systemd/system/multi-user.target.wants/kubelet.service already exists. Not making a new symlink
I0314 16:51:27.383063    6697 update.go:404] Deleting stale data
I0314 16:51:27.383084    6697 update.go:648] Writing SSHKeys at "/home/core/.ssh/authorized_keys"
I0314 16:51:27.385047    6697 daemon.go:1009] Current and target osImageURL have matching digest "sha256:c09f455cc09673a1a13ae7b54cc4348cda0411e06dfa79ecd0130b35d62e8670"
I0314 16:51:27.385071    6697 update.go:648] Writing SSHKeys at "/home/core/.ssh/authorized_keys"
I0314 16:51:27.386892    6697 update.go:385] Updating files
I0314 16:51:27.386910    6697 update.go:586] Writing file "/etc/sysconfig/crio-network"
I0314 16:51:27.388912    6697 update.go:586] Writing file "/var/lib/kubelet/config.json"
I0314 16:51:27.393087    6697 update.go:586] Writing file "/etc/kubernetes/ca.crt"
I0314 16:51:27.396004    6697 update.go:586] Writing file "/etc/sysctl.d/forward.conf"
I0314 16:51:27.398847    6697 update.go:586] Writing file "/etc/kubernetes/kubelet-plugins/volume/exec/.dummy"
I0314 16:51:27.400039    6697 update.go:586] Writing file "/etc/containers/registries.conf"
I0314 16:51:27.402039    6697 update.go:586] Writing file "/etc/containers/storage.conf"
I0314 16:51:27.407845    6697 update.go:586] Writing file "/etc/crio/crio.conf"
I0314 16:51:27.413978    6697 update.go:586] Writing file "/etc/kubernetes/kubelet.conf"
I0314 16:51:27.417781    6697 update.go:523] Writing systemd unit "kubelet.service"
I0314 16:51:27.420848    6697 update.go:558] Enabling systemd unit "kubelet.service"
I0314 16:51:27.420897    6697 update.go:476] /etc/systemd/system/multi-user.target.wants/kubelet.service already exists. Not making a new symlink
I0314 16:51:27.420908    6697 update.go:404] Deleting stale data
I0314 16:51:27.421027    6697 daemon.go:396] Unable to apply update: test
E0314 16:51:27.421053    6697 writer.go:98] marking degraded due to: test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so to understand, this tried to do the first config, for whatever reason it can't, so it goes back to writing the previous config that didnt have the chrony.conf & annotates as degraded?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, and keeps trying to apply the new config as expected by a "controller" which keeps retrying (even if in this case, the obvious fix would be to remove the MC, create a new correct one, and reconcile, this was just a smoke test)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so going forward to recover from a bad mc, it's just now make a new mc or make a new mc & remove degraded annotation? (i think we should probably write this down for users?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recover from a bad mc is just 1) delete the bad MC first, 2) create a new MC with something "good" which applies

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should, in the future, react to deletions and just undo the whole generated machineconfigs altogether

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do actually believe that deleting an MC is triggering a regeneration of the generated MC for the pool (according to the render_controller code)

runcom added 3 commits March 14, 2019 17:10
Effectively avoiding any error like this:

E0314 14:49:12.734384    5690 event.go:259] Could not construct reference to: '&v1.Node{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"ip-10-0-147-217.us-west-2.compute.internal", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Spec:v1.NodeSpec{PodCIDR:"", ProviderID:"", Unschedulable:false, Taints:[]v1.Taint(nil), ConfigSource:(*v1.NodeConfigSource)(nil), DoNotUse_ExternalID:""}, Status:v1.NodeStatus{Capacity:v1.ResourceList(nil), Allocatable:v1.ResourceList(nil), Phase:"", Conditions:[]v1.NodeCondition(nil), Addresses:[]v1.NodeAddress(nil), DaemonEndpoints:v1.NodeDaemonEndpoints{KubeletEndpoint:v1.DaemonEndpoint{Port:0}}, NodeInfo:v1.NodeSystemInfo{MachineID:"", SystemUUID:"", BootID:"", KernelVersion:"", OSImage:"", ContainerRuntimeVersion:"", KubeletVersion:"", KubeProxyVersion:"", OperatingSystem:"", Architecture:""}, Images:[]v1.ContainerImage(nil), VolumesInUse:[]v1.UniqueVolumeName(nil), VolumesAttached:[]v1.AttachedVolume(nil), Config:(*v1.NodeConfigStatus)(nil)}}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'Reboot' 'Node will reboot into config worker-47d52401530ce05855ba991fbb59df12'

Signed-off-by: Antonio Murdaca <[email protected]>
@kikisdeliveryservice
Copy link
Contributor

awesome, i'll look at this today!

@kikisdeliveryservice
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 14, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, kikisdeliveryservice, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [cgwalters,kikisdeliveryservice,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@runcom
Copy link
Member Author

runcom commented Mar 14, 2019

aws timeout

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 8badaae into openshift:master Mar 15, 2019
@kikisdeliveryservice
Copy link
Contributor

merged! let the fun begin! 🎉

@smarterclayton
Copy link
Contributor

This might have broken upgrade (we had a run of green upgrades then this edge went red
https://openshift-release.svc.ci.openshift.org/releasetag/4.0.0-0.ci-2019-03-15-021425?from=4.0.0-0.ci-2019-03-15-004920).

We'll monitor and let you know if it did, and I'll also open a PR for you in the morning to add the e2e-aws-upgrade job to the repo so you can test yourself (it's in a slow roll).

@smarterclayton
Copy link
Contributor

Cautious no - things started greening again after.

@runcom
Copy link
Member Author

runcom commented Mar 15, 2019

Alright, I've already identified some regressions with this PR but I'm working on fixing them and add test coverage as well (this change was needed anyway)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Strengthen error-handling when dealing with transient errors logging: dealing with selflink empty error

7 participants